In this lab, we’ll look at how to implement two unsupervised classification methods:
The data we will use is from the Gap Minder project. While you can download the individual variables from the web site, we will use a preprocessed set of the data covering the period 1801-2018, in the file gapminder_1800_2018.csv. Download this to your datafiles folder and make a new folder for today’s class called lab11. You will also need the shapefile of country borders: ne_50m_admin_0_countries.shp (you should have this from a previous lab).
Start RStudio and set the working directory to this directory (This can be changed by going to the [Session] menu and selecting [Set working directory] -> [Choose directory…], then browsing to the folder).
You will need the following packages for today’s lab, so make sure to install anything that is missing before proceeding.
If you are using Python for today’s lab, you’ll need to download the Jupyter notebook for this lab (GEOG_5160_6160_lab08.ipynb). Make a new folder for the lab (lab08) and move the notebook to this. Now open a new terminal (in Windows go to the [Start Menu] > [Anaconda (64-bit)] > [Anaconda prompt]).
You will need to make sure the following packages are installed on your computer (in addition to the packages we have used in previous labs).
conda install country_converter)pip install minisom)Once opened, change directory to the folder you just made, activate your conda environment
conda activate geog5160
Start the Jupyter Notebook server:
jupyter notebook
And open the notebook for today’s class.
Once you have set everything up, we can start by reading in the data.
gap_df = read.csv("../datafiles/gapminder_1800_2018.csv")
head(gap_df)
## country year child_mortality fertility per_cap_co2 income life_expectancy
## 1 Afghanistan 1800 469 7 NA 603 28.2
## 2 Afghanistan 1801 469 7 NA 603 28.2
## 3 Afghanistan 1802 469 7 NA 603 28.2
## 4 Afghanistan 1803 469 7 NA 603 28.2
## 5 Afghanistan 1804 469 7 NA 603 28.2
## 6 Afghanistan 1805 469 7 NA 603 28.2
## population
## 1 3280000
## 2 3280000
## 3 3280000
## 4 3280000
## 5 3280000
## 6 3280000
Now load the following directories to process and visualize the data:
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggpubr)
## Loading required package: ggplot2
library(RColorBrewer)
We can now make some plots to look at the data. First, histograms of the three variables in the data (gdp per capita, population and life expectancy):
names(gap_df)
## [1] "country" "year" "child_mortality" "fertility"
## [5] "per_cap_co2" "income" "life_expectancy" "population"
h1 <- gap_df %>%
gghistogram(x = "child_mortality")
h2 <- gap_df %>%
gghistogram(x = "fertility")
h3 <- gap_df %>%
gghistogram(x = "per_cap_co2")
h4 <- gap_df %>%
gghistogram(x = "income")
h5 <- gap_df %>%
gghistogram(x = "life_expectancy")
h6 <- gap_df %>%
gghistogram(x = "population")
ggarrange(h1, h2, h3, h4, h5, h6)
The histograms clearly show the skew in most of the variables, and we will need to transform the data before analyzing it. We’ll do this in two steps. First, we’ll log transform all variables to remove the skew, then use the scale() function, which in R scales data to \(z\)-scores by subtracting the mean and dividing by the standard deviation. To simplify things, we do all this in a single step by nesting the log() function in the scale() function.
Note that we remove rows with missing observations (we could potentially impute these, but that is a little more difficult for time series data). We also remove any row where the per capita CO2 emissions are zero, partly because this would cause problems when log-transforming and partly because this is clearly not true.
gap_df_scale <- gap_df %>%
drop_na() %>%
filter(per_cap_co2 > 0) %>%
mutate(
child_mortality = scale(log(child_mortality)),
fertility = scale(log(fertility)),
per_cap_co2 = scale(log(per_cap_co2)),
income = scale(log(income)),
life_expectancy = scale(log(life_expectancy)),
population = scale(log(population))
)
The very last thing we’ll do is to extract a single year of data for the first part of the lab (just 2018):
gap_df_scale2 <- gap_df_scale %>%
filter(year == 2018)
We’ll next use these data with the \(k\)-means cluster algorithm. The default function for this in R is kmeans(), and we need to pass the data we are going to use, and the number of requested clusters to this function. First let’s figure out which columns we need:
names(gap_df_scale2)
## [1] "country" "year" "child_mortality" "fertility"
## [5] "per_cap_co2" "income" "life_expectancy" "population"
gap_kmeans = kmeans(gap_df_scale2[,3:8], centers = 6)
There output of this algorithm provides information about the cluster solution. We can see the size of each cluster (the number of observations assigned to each cluster)
gap_kmeans$size
## [1] 16 39 25 28 35 41
And we can see the cluster assignments for each country:
gap_kmeans$cluster
## [1] 6 4 5 6 4 3 4 3 2 5 4 2 5 4 2 2 4 6 4 5 2 4 3 4 2 6 6 5 6 3 4 6 6 3 3 5 1
## [38] 6 6 2 6 2 2 2 2 2 1 5 5 5 5 1 6 2 1 6 4 2 3 1 6 4 3 6 2 4 5 6 6 4 6 5 2 2
## [75] 5 5 3 5 2 2 3 4 3 5 3 6 1 2 5 5 2 2 1 6 2 2 2 6 6 3 4 6 4 6 4 3 1 4 4 4 5
## [112] 6 5 1 5 2 2 5 6 6 5 4 2 2 5 1 4 6 5 5 5 3 2 2 3 3 6 1 1 3 6 2 4 6 2 2 2 1
## [149] 6 5 3 6 3 5 4 4 6 4 2 2 5 6 6 3 1 6 1 4 5 3 5 6 3 2 3 3 2 5 1 5 5 6 6 6
And finally the cluster centers. These are the prototypes; the average value of each variable for all the observations assigned to a given cluster (note these are the log-transformed and scaled data)
gap_kmeans$centers
## child_mortality fertility per_cap_co2 income life_expectancy
## 1 -0.6589995 -0.2025178 0.04389658 0.07661139 0.7068365
## 2 -2.3412976 -1.7022978 1.04896180 1.78660959 1.2456758
## 3 -2.0578481 -1.6717077 1.06017066 1.60093748 1.2013266
## 4 -1.4535825 -1.4561627 0.67303104 1.04391315 1.0433825
## 5 -1.0349622 -0.9885902 0.38493502 0.47277532 0.9844161
## 6 -0.2143117 0.2657261 -0.57872982 -0.69277091 0.5368623
## population
## 1 -1.18770387
## 2 -0.06597956
## 3 1.39410191
## 4 -1.09818973
## 5 0.91695570
## 6 0.61744335
Here, we can see that cluster 6, for example, has the lowest life expectancy and GDP, as well as high infant mortality and fertility rates.
As these are spatial data, we can now plot the distribution of the clusters. First, we’ll load the world shapefile in the ne_50m_admin_0_countries folder (we used this in a previous lab:)
library(sf)
## Linking to GEOS 3.8.1, GDAL 3.1.4, PROJ 6.3.1
borders = st_read("../datafiles/ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp")
## Reading layer `ne_50m_admin_0_countries' from data source `/Users/u0784726/Dropbox/Data/devtools/geog5160/datafiles/ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp' using driver `ESRI Shapefile'
## Simple feature collection with 241 features and 94 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -180 ymin: -89.99893 xmax: 180 ymax: 83.59961
## geographic CRS: WGS 84
Next, we add the cluster assignments to the gap2_df data frame:
gap_df_scale2$kmeans <- as.factor(gap_kmeans$cluster)
Now we need to merge the gap2_df data frame with the world borders spatial object. To do this, we add a new column to the data frame containing the ISO3 standard country codes, using the countrycode package (you’ll need to install this).
library(countrycode)
gap_df_scale2 <- gap_df_scale2 %>%
mutate(ISO3 = countrycode(country, "country.name", "iso3c"))
Finally, we use these ISO codes to merge the two datasets. We need to specify the column name for each dataset that contains the label to be used in merging, which we do with by.x and by.y.
cluster_sf = merge(borders, gap_df_scale2, by.x = "ADM0_A3", by.y = "ISO3")
Finally, we can plot out the clusters using the tmap package, which highlights the position of this poorer cluster 6 in Saharan and sub-Saharan Africa:
library(tmap)
tm_shape(cluster_sf) +
tm_fill("kmeans", palette = "Set2") +
tm_legend(legend.position = c("left", "bottom"))
We’ll briefly demonstrate the use of hierarchical clustering here. This works a little differently in R, as it requires you to first calculate a dissimilarity matrix (the aggregate difference between each pair of observations), and then use this for clustering.
The R function dist() will calculate this, based on simple Euclidean dissimiarity between samples:
gap_d <- dist(gap_df_scale2[,3:8])
We can now use this to carry out hierarchical clustering using the hclust() function:
gap_hclust <- hclust(gap_d)
And you can then plot the resulting dendrogram with:
plot(gap_hclust)
To actually get a cluster assigment for each observation, we need to ‘cut’ the tree in a way that it will result in the desired number of groups, using the well-named cutree() function, To get 6 clusters:
gap_cluster <- cutree(gap_hclust, 6)
head(gap_cluster)
## [1] 1 2 3 4 5 3
To see what this has done, you can overlay the clusters on the dendogram as follows:
plot(gap_hclust)
rect.hclust(gap_hclust, k = 6)
The clusters that you obtain from the cutree() function can be used in the same way that we used the \(k\)-means clusters. Here, we’ll append them to the spatial object (cluster_sf) and map them out:
gap_df_scale2$hclust <- as.factor(gap_cluster)
cluster_sf = merge(borders, gap_df_scale2, by.x = "ADM0_A3", by.y = "ISO3")
tm_shape(cluster_sf) +
tm_fill("hclust", palette = "Set2") +
tm_legend(legend.position = c("left", "bottom"))
We’ll now repeat this analysis with a self-organizing map (SOM), but using the full dataset. For this we’ll use the kohonen package, which allows the construction of unsupervised and supervised SOMs.
library(kohonen)
Building the SOM is relatively simple: first we construct the grid of nodes (30x20) using the somgrid() function. Note the topo argument that controls the neighborhood shape (rectangular or hexagonal):
som_grid = somgrid(xdim = 30, ydim = 20, topo = "hexagonal")
You can plot the empty grid to see the layout with (plot(som_grid)).
Now we’ll train the SOM. The function we use is som(), and we need to specify the data to be used as well as the SOM grid. The function requires the data to be as a matrix rather than an R data frame, so we first convert this, then pass it to the function:
gap_mat = as.matrix(gap_df_scale[,3:8])
gap_som = som(gap_mat, som_grid)
Once finished, we need to see if the algorithm has converged. We can do this by plotting the change in the loss function as a function of the iterations. As a reminder, the loss function here is the mean distance between the node weights and the observations assigned to that node:
plot(gap_som, type = "changes")
We can see here that the loss function has decreased, but not stabilized suggesting that the algorithm has converged. We’ll re-run it for a larger number of iterations by setting rlen to 1000 (this may take a couple of minutes to run):
gap_som = som(gap_mat, som_grid, rlen = 1000)
plot(gap_som, type = "changes")
This plot shows a series of step like drops in the loss function, finally stablizing at approximately iteration 1600.
Next we’ll go through some visualizations of the data. We’ve already see then loss function changes, so next we’ll make a couple of plots to show how the SOM has trained. First a plot of distances between nodes, which can be useful to identify outliers:
plot(gap_som, type = "dist")
Next, we plot the number of observations per cluster to look for empty nodes.
plot(gap_som, type = "counts")
THe map has a couple of empty nodes. Empty nodes can be common when using larger grid sizes, and can be taken to represent a part of the parameter space that is missing in the observations.
Next, we’ll plot the codebook vectors. These are the weights associated with each feature, and can be used to interpret the organization of the SOM. For each node, there is a radial plot showing the value of each feature associated with it. Note that the type of plot can change here. If you have a large number of features, this will plot as a line plot rather than a radial plot.
plot(gap_som, type = "code")
The map shows a clear gradient from richer regions (income, orange) on the top left to poorer on the right, and a gradient from smaller populations at the top to smaller at the bottom. Note that the life expectancy follows the gradient of GDP fairly well, and the poorer regions tend to have higher infant mortality and fertility rates.
It is also possible to plot out the individual feature weights as a heatmap to illustrate the gradient using type = "property". The codebook vectors of weights are contained within the gap_som object. The syntax to extract them is a little complicated, and some examples are given below
plot(gap_som, type = "property", property = gap_som$codes[[1]][,"population"],
main = "Pop")
plot(gap_som, type = "property", property = gap_som$codes[[1]][,"life_expectancy"],
main = "Life Exp.")
plot(gap_som, type = "property", property = gap_som$codes[[1]][,"income"],
main = "Income")
plot(gap_som, type = "property", property = gap_som$codes[[1]][,"per_cap_co2"],
main = "Per capita CO2")
As a next step, we’ll cluster the nodes of the map to try and identify homogenous regions to simplify the interpretation of the SOM. While \(k\)-means is often used here, we’ll use a hierarchical algorithm (for reasons that will be clear shortly) to form 6 clusters. Again, we first calculate the dissimilarity matrix, but this time between the values (codes) of each of the SOM nodes. Then we use the cutree() and hclust() function to form the clusters
dist_codes <- dist(gap_som$codes[[1]])
som_cluster <- cutree(hclust(dist_codes), 6)
We can now show this on the SOM grid using the plot() function, setting type to mapping and using the cluster values in som_cluster to set the colors. The cluster boundaries can also be overlaid with add.cluster.boundaries()
# Define a color palette
mypal = brewer.pal(6, "Set2")
plot(gap_som, type="mapping", bgcol = mypal[som_cluster],
main = "Clusters", pch = '.')
add.cluster.boundaries(gap_som, som_cluster)
The main problem with this is that the clusters are not-contiguous on the SOM grid. This reflects the non-linear mapping that a SOM carries out (and may also suggest that 6 clusters is too high). We can force the clusters to be contiguous by including the distance between the nodes on the SOM grid as well as the dissimilarity between codebook values. We’ve already calulated this latter value above (as codes_dist), so let’s get the distance between the nodes using the handy helper function unit.distances():
dist_grid <- unit.distances(som_grid)
We then multiply these two distance/dissimilarity matrices together:
dist_adj <- as.matrix(dist_codes) * dist_grid
If we now repeat the clustering, the inclusion of the distances between nodes now forces these clusters to be contguous:
clust_adj = hclust(as.dist(dist_adj), 'ward.D2')
som_cluster_adj = cutree(clust_adj, 6)
plot(gap_som, type="mapping", bgcol = mypal[som_cluster_adj],
main = "Clusters", pch = '.')
add.cluster.boundaries(gap_som, som_cluster_adj)
Next, we obtain aggregate values for each cluster. While we can get these values from the gap_som object, we can also work back to get values from the original data (in gap_df). In the following code we:
gap2_df dataframe as a factor (rather than an integer)nodeByCountry <- gap_som$unit.classif
gap_df_scale$somclust <- as.factor(som_cluster_adj[nodeByCountry])
To get the values for each cluster on the original, non-transformed scale, we need to merge gap_df_scale which contains the cluster assignments with the original data (gap_df)
gap_df_clus <- merge(gap_df, gap_df_scale, by = c("country", "year"))
Now we can use this list of clusters by country to aggregate the variables:
cluster.vals <- aggregate(gap_df_clus[,3:8],
by = list(gap_df_scale$somclust), mean)
Which gives the following table:
| Group.1 | child_mortality.x | fertility.x | per_cap_co2.x | income.x | life_expectancy.x | population.x |
|---|---|---|---|---|---|---|
| 1 | 87.88370 | 4.671917 | 1.1386456 | 4211.982 | 63.49782 | 3656978 |
| 2 | 90.01528 | 3.485994 | 8.2206154 | 16783.211 | 62.25943 | 12652761 |
| 3 | 260.35413 | 5.254382 | 1.9952001 | 3908.525 | 43.81278 | 10217961 |
| 4 | 327.81105 | 6.320487 | 0.1437743 | 1464.744 | 36.87477 | 24639196 |
| 5 | 16.54619 | 2.002199 | 8.2011854 | 22716.948 | 74.59142 | 38317906 |
| 6 | 139.42213 | 5.523652 | 0.8609288 | 3765.815 | 57.94809 | 71481735 |
This shows the approximate characteristics for the clusters:
As we have data over multiple years, we can also use the results of the SOM analysis to track the change made by an individual country over time. To do this, we need to know which node a country was assigned to for each of it’s time steps. To do this we
country_id <- which(gap_df_scale$country == "United States")
country_nodes <- gap_som$unit.classif[country_id]
country_crds <- som_grid$pts[country_nodes, ]
Now we plot out the grid again, shaded with the cluster assignments. Then we use R’s line() function to overlay the country’s trajectory. In the last couple lines, we make up a vector of years to label the trajectory. As there will be a lot of these, we only keep the label for 20 year increments. Then we add these to the plot
plot(gap_som, type="mapping", bgcol = mypal[som_cluster_adj],
main = "Clusters", pch = '', keepMargins = TRUE)
#add.cluster.boundaries(gap_som, som_cluster_adj)
lines(country_crds, lwd=2)
years <- ifelse(gap_df_scale$year[country_id] %% 20 == 0, as.character(gap_df_scale$year[country_id]), "")
text(jitter(country_crds), labels = years, cex = 0.75)
And finally, we can plot out the spatial distribution of these clusters for any given year in the dataset. First we need to assign the country codes to link to the shapefile:
gap_df_scale <- gap_df_scale %>%
mutate(ISO3 = countrycode(country, "country.name", "iso3c"))
Next we’ll extract a single year and merge it to our spatial object (cluster_sf):
myyear <- 2000
tmp.df <- gap_df_scale %>%
filter(year == myyear)
cluster_sf = merge(borders, tmp.df, by.x = "ADM0_A3", by.y = "ISO3",
all = TRUE)
And finally, plot it:
tm_shape(cluster_sf) +
tm_fill("somclust", palette = "Set2") +
tm_layout(main.title = paste("GAPMinder data", myyear))
## Warning: The shape cluster_sf contains empty units.
For the exercise, you will need to build use an unsupervised classification method with the California housing dataset using a self-organizing map. Your answer should use the following variables (you can use code from a previous example to create the average number of rooms per house, and the ratio of bedrooms to rooms):
housing_median_agepopulationmedian_incomemedian_house_valueavg_roomsbedroom_ratioYour answer should consist of the following:
You do not need to provide a map of the SOM clusters, but if you do, the following code will be useful:
housing.crds = housing %>%
select(longitude, latitude)
housing data frame, and convert to a Spatial* object. This assumes that the SOM output is stored in housing.som and the cluster assignment in som.cluster:housing$cluster = as.factor(som.cluster[housing.som$unit.classif])
housing = cbind(housing, housing.crds)
coordinates(housing) = ~longitude+latitude